Overview

Give a brief a description of your project and its goal(s), what data you are using to complete it, and what three faculty/staff in different fields you have spoken to about your project with a brief summary of what you learned from each person. Include a link to your final project GitHub repository.

This project considers the burdens of poor air quality and its health consequences, and how they fall on different American communities. To explore and understand these relationships, I use daily site-level testing data from the EPA’s Air Quality System for small particulate matter (PM2.5) from 2019, linked with census data for the tract surrounding each testing site. I am still determining how I can best incorporate data on the near-term health effects of poor air quality, although my main goal is not to use air quality to predict those health outcomes, but rather to determine whether there are differences (e.g. in extent/significance) between how community characteristics predict air quality and how those same characteristics predict the prevalence of events like emergency department presentations for asthma exacerbation. In developing a plan for this project I met with Drs. Anil Vachani, Gary Weissman, and John P. Reilly, who provided insights on the causal pathway from pollution sources to specific EPA-measured pollutants to short- and long-term health outcomes; potential data sources and analytic approaches; the ways in which “natural experiments” related to major changes in pollutant generation have been used to compare the effects of perturbation from a steady state; and more. Materials for this project can be found in this GitHub repository.

Introduction

Describe the problem addressed, its significance, and some background to motivate the problem. Explain why your problem is interdisciplinary, what fields can contribute to its understanding, and incorporate background related to what you learned from meeting with faculty/staff.

Poor air quality can have a substantial negative influence on health and quality of life, but different people will have different resources, incentives, and tradeoffs to consider in determining where to live, and we might expect that the relationship between who lives where and air quality is more complex than people with more wealth, income, and social power concentrating in places with better air quality. By considering many different community characteristics in building models to predict poor air quality, it is possible to learn which factors are most strongly associated with air quality and how successfully models trained on community characteristics can predict poor air quality and related health events.

Methods

Describe the data used and general methodological approach. Subsequently, incorporate full R code necessary to retrieve and clean data, and perform analysis. Be sure to include a description of code so that others (including your future self) can understand what you are doing and why.

I begin by defining a function to create a data frame from the results of a GET call to the EPA’s Air Quality System API. I create objects with relevant reference lists and define a vector for the “lower 48” U.S. states and Washington, D.C. for later use.

# Retrieve environment variables with EPA AQS API credentials
api.email <- Sys.getenv("EPA_AQS_EMAIL")
api.key   <- Sys.getenv("EPA_AQS_KEY")

# Function to create a data frame based on an API call
download.AQS <- function(whichData,customParams) { 
  rootpath <- "https://aqs.epa.gov/data/api/"
  morepath <- paste0(rootpath,whichData)
  fullParams <- append(list(email = api.email, key = api.key),customParams)
  step1.json <- httr::GET(url = morepath, query = fullParams)
  step2.parsed <- jsonlite::fromJSON(httr::content(step1.json,"text"), simplifyVector = FALSE)
  step3.df <- tibble(Data = step2.parsed$Data) %>% unnest_wider(.,Data)
  return(step3.df)
}

# Fetch documentation for daily data summaries, state code list
dailyData.dictionary <- download.AQS("metaData/fieldsByService",list(service = "dailyData"))
list.states <- download.AQS("list/states",list())

# Omitting state codes other than 48 states + D.C.
list.states.lower48 <- list.states %>% filter(!(code %in% c("02","15","66","72","78","80","CC")))
list.states.lower48
lower48.codes <- as.vector(as.matrix(list.states.lower48[,1]))

To manage downloading daily PM2.5 data for each state for all of 2019, I define another function to loop through the state FIPS codes for given date parameters. Executing this step with 1 call per month of data takes roughly 90 minutes, so I save the resulting data frame and load it in the chunk below.

# Pull daily PM2.5 (parameter 88101) data for these states, with arguments for date range
get.daily.lower48 <- function(b,e){
collector <- data.frame()
for (s in seq_along(lower48.codes)) {
  newstate <- download.AQS("dailyData/byState", list(param="88101", bdate=b, edate=e, state=list.states.lower48[s,1]))
  collector <- bind_rows(collector,newstate)
  }
  return(collector)
}

# Run in monthly batches
dailyPM2.5_201901 <- get.daily.lower48("20190101","20190131")
dailyPM2.5_201902 <- get.daily.lower48("20190201","20190228")
dailyPM2.5_201903 <- get.daily.lower48("20190301","20190331")
dailyPM2.5_201904 <- get.daily.lower48("20190401","20190430")
dailyPM2.5_201905 <- get.daily.lower48("20190501","20190531")
dailyPM2.5_201906 <- get.daily.lower48("20190601","20190630")
dailyPM2.5_201907 <- get.daily.lower48("20190701","20190731")
dailyPM2.5_201908 <- get.daily.lower48("20190801","20190831")
dailyPM2.5_201909 <- get.daily.lower48("20190901","20190930")
dailyPM2.5_201910 <- get.daily.lower48("20191001","20191031")
dailyPM2.5_201911 <- get.daily.lower48("20191101","20191130")
dailyPM2.5_201912 <- get.daily.lower48("20191201","20191231")
 
# Concatenate monthly batch files into a single file for 2019
all.2019.dailyPM2.5 <- bind_rows( dailyPM2.5_201901
                                 ,dailyPM2.5_201902
                                 ,dailyPM2.5_201903
                                 ,dailyPM2.5_201904
                                 ,dailyPM2.5_201905
                                 ,dailyPM2.5_201906
                                 ,dailyPM2.5_201907
                                 ,dailyPM2.5_201908
                                 ,dailyPM2.5_201909
                                 ,dailyPM2.5_201910
                                 ,dailyPM2.5_201911
                                 ,dailyPM2.5_201912)

# Save full 2019 raw file
saveRDS(all.2019.dailyPM2.5, file = "all2019dailyPM25.RDS")

I test combinations of testing site location identifiers and add a single site_id variable to the 2019 PM2.5 data set by concatenating location identifiers. I also create a data frame with one observation per testing site.

# Load the full 2019 raw file
all.2019.dailyPM2.5 <- readRDS("all2019dailyPM25.RDS")
str(all.2019.dailyPM2.5)

# Create a file of all unique combinations of geographic indicators for testing sites
testing.sites <- all.2019.dailyPM2.5 %>%
                      dplyr::select(state_code, state, 
                             county_code, county,
                             city, cbsa_code, cbsa,
                             site_number, local_site_name, site_address,
                             latitude, longitude) %>%
                      unique() %>%
                      arrange(state_code,county_code,site_number)
dim(testing.sites)
# 965 combinations

# Create a single unique ID column for site
testing.sites %>% dplyr::select(site_number) %>% unique() %>% dim()
testing.sites %>% dplyr::select(state_code,site_number) %>% unique() %>% dim()
testing.sites %>% dplyr::select(state_code,county_code,site_number) %>% unique() %>% dim()
# Only 253 unique site numbers. 770 state-site combos. 
# State + county + site number *is* unique (965 combos)

clean.2019.dailyPM2.5 <- all.2019.dailyPM2.5 %>%
  mutate(site_id = paste0(state_code,county_code,site_number))

Using the tigris::tracts function, I retrieve the geometries, GEOIDs, and other spatial information for census tracts and join each AQS testing site to the data for the tract within which it lies, based on the latitude and longitude from AQS. I looping through the state FIPS codes one at a time and stack the results into a single crosswalk file. This step is time-consuming so I save the resulting file and load it in the next step.

# Prepare for spatial merge. Create sf file from testing site coordinates
testing.sites.sf.points <- st_as_sf(testing.sites, coords = c('longitude','latitude'), crs = 4269)

# Download census tract sf shapefiles, one state at a time.
# Perform spatial joins one state at a time to avoid having to build the national tract shapefile.
for (s in seq_along(lower48.codes)) {
  newstate <- lower48.codes[s]
  newtracts <- tracts(newstate)
  newjoin <- st_join(newtracts, testing.sites.sf.points, join = st_contains, left=FALSE)
  if (s==1){
    testing.sites.tracts <- newjoin
  } else {
    testing.sites.tracts <- bind_rows(testing.sites.tracts,newjoin)
  }
}

# Save file with testing sites <> census tract
saveRDS(testing.sites.tracts, file = "testing_sites_tracts.RDS")

From tidycensus I now retrieve the variable list for the 2018 5-year American Community Survey. After reviewing the concepts from those variables as well as the questionnaire, I create an object with a set of variables to test for associations with the AQS data.

testing.sites.tracts <- readRDS("testing_sites_tracts.RDS")
head(testing.sites.tracts)

# Load a data frame with all 5-year ACS variables
census.1yr.vars <- load_variables(dataset = "acs5",year = 2018)
head(census.1yr.vars)

# Simplify the variable list to make it easier to review the major concepts represented
census.1yr.concept.counts <- 
  census.1yr.vars %>% 
  mutate(root.concept = str_split_fixed(concept," BY",2)[,1]) %>%
  mutate(root.concept = str_split_fixed(root.concept," \\(",2)[,1]) %>% 
      # Double escape needed to match open parenthesis!
  mutate(root.name = substring(name,1,6)) %>%
  group_by(root.name,root.concept) %>%
  summarise(obs=n(), .groups="keep") %>%
  arrange(root.name,-obs)

# View(sort(unique(census.1yr.concept.counts$root.concept)))

# After reviewing the variable list, create an object with the ones of interest
final.acs.var.list <- census.1yr.vars %>% filter(name %in% c(
     "B01002_001" # median age
    ,"B03002_003","B03002_001" # Hon-Hisp-White-alone // Denom
    ,"B02009_001","B02001_001" # Black race, alone or in combination // Denom
    ,"B06008_003","B06008_001" # Now married // Denom
    ,"B23007_002","B23007_001" # Children under 18yrs in household // Denom
    ,"B15003_022","B15003_023","B15003_024","B15003_025","B15003_001" # Bachelors // Masters // Professional // Doctorate // Denom
    ,"B23022_027","B23022_026","B23022_003","B23022_002" # Worked in past 12 months-Female // Denom-Female // p12-Male // Denom-Male
    ,"B21001_002","B21001_001" # Veteran // Denom
    ,"B17001_002","B17001_001" # Past 12mo income at or below poverty level, Denom
    ,"B22010_002","B22010_001" # Received SNAP in past 12mo, Denom
    ,"B22008_001"  # Median household income, past 12mo
    ,"B25008_003","B25008_001" # Renter occupied, Denom
    ,"B25024_002","B25024_003","B25024_001" # Single-fam, detached // single-fam, attached // Denom
    ,"B25064_001" # Median gross rent
    ,"B25088_002" # Median monthly owner costs for households with a mortgage
    ,"B08126_004","B08126_002","B08126_001" # INDUSTRY: Manufacturing // Agriculture, fishing, mining // Denom
    ,"B08006_003","B08006_001" # Drove to work alone // Denom
    ,"B08013_001" # Aggregate travel time to work in minutes
  )) %>% arrange(name)

final.acs.var.list

Once more I loop through the state FIPS codes, retrieving the ACS variables of interest by tract as a flat file using tidycensus::get_acs and preserving the data only for the census tracts that contain AQS testing sites. As this is another slow step, I save the resulting data frame.

head(testing.sites.tracts)

# Keep only what is needed to make the ACS calls in a small data frame
tst.minimal <- testing.sites.tracts %>% 
                mutate(site_id = paste0(state_code,county_code,site_number)) %>%
                dplyr::select(site_id, GEOID, STATEFP, COUNTYFP)
st_geometry(tst.minimal) <- NULL

# Loop through existing list of state codes
lower48.codes

# Make ACS calls one state at a time, only for counties with AQS testing sites
# Preserve estimate columns, ditch MOE
# Join to testing site list
# Stack ACS data together in a single data frame

acs.loop <- function(){
  for (st in seq_along(lower48.codes)) {
    
    newstate <- lower48.codes[st]
    
    county.list <- tst.minimal %>% 
                  filter(STATEFP==newstate) %>% 
                  dplyr::select(COUNTYFP) %>%
                  arrange(COUNTYFP) %>%
                  as.matrix() %>% as.vector() %>% as.list()
    
    results.acs <- get_acs( geography = "tract"
                          ,variables=as.list(final.acs.var.list$name)
                          ,state = newstate
                          ,county = county.list
                          ,output = "wide")
    
    smaller.acs <- results.acs %>% dplyr::select(GEOID,matches(".*_.*E$"))
  
    linked.acs <- tst.minimal %>% 
                filter(STATEFP==newstate) %>%
                left_join(.,smaller.acs,by="GEOID")
    
    if (st==1){
      collect <- linked.acs
    } else {
      collect <- bind_rows(collect,linked.acs)
    }
  }
  return(collect)
}

testing.sites.acs <- acs.loop()
testing.sites.acs

saveRDS(testing.sites.acs, file = "testing_sites_acs.RDS")

With all the ACS data I need, I can create clean versions of the variables to be included in the analysis. I exclude tracts that do not have complete data for all 19 ACS-based variables or with fewer than 500 respondents in the sample (using the denominator from the set of race variables).

testing.sites.acs <- readRDS("testing_sites_acs.RDS")
head(testing.sites.acs)

# Create percentage variables and scaled medians/totals to use in modeling
testing.sites.acs.ready <- testing.sites.acs %>%
  dplyr::rename( age.median       = B01002_001E
                ,income.hh.median = B22008_001E 
                ,rent.median      = B25064_001E
                ,home.pmt.median  = B25088_002E
                ,commute.time.agg = B08013_001E) %>%
  mutate( race.black.any          = 100*B02009_001E/B02001_001E
         ,ethn.nhisp.white.alone  = 100*B03002_003E/B03002_001E
         ,married                 = 100*B06008_003E/B06008_001E
         ,kids                    = 100*B23007_002E/B23007_001E
         ,bachelor.plus           = 100*(B15003_022E + B15003_023E + B15003_024E + B15003_025E)/B15003_001E
         ,veteran                 = 100*B21001_002E/B21001_001E
         ,manufacturing           = 100*B08126_004E/B08126_001E
         ,farm.fish.mining        = 100*B08126_002E/B08126_001E
         ,commutes.car.alone      = 100*B08006_003E/B08006_001E
         ,renter                  = 100*B25008_003E/B25008_001E
         ,single.fam.home         = 100*(B25024_002E + B25024_003E)/B25024_001E
         ,worked.12mo             = 100*(B23022_027E + B23022_003E)/(B23022_026E + B23022_002E) 
         ,poverty.12mo            = 100*B17001_002E/B17001_001E
         ,snap.12mo               = 100*B22010_002E/B22010_001E
         ,denominator             = B02001_001E) %>%
  dplyr::select(-matches("^B[012].*"))

head(testing.sites.acs.ready)
summary(testing.sites.acs.ready)
# testing.sites.acs.ready %>% dplyr::filter(denominator < 1000) %>% View()
# testing.sites.acs.ready %>% dplyr::filter(denominator < 500 | !complete.cases(.)) %>% View()

length(unique(testing.sites.acs.ready$GEOID))
# 953 - not unique by tract

# 34 tracts to exclude based on incomplete data and total denominator
# Bring the aggregates and medians to a comparable scale
testing.sites.acs.complete <- 
  testing.sites.acs.ready %>% 
  mutate(commute.time.agg  = commute.time.agg/10000
         ,income.hh.median = income.hh.median/10000
         ,rent.median      = rent.median/100
         ,home.pmt.median  = home.pmt.median/100
         ) %>%
  dplyr::rename(commute.time.agg.10k  = commute.time.agg
                ,income.hh.median.10k = income.hh.median
                ,rent.median.100      = rent.median
                ,home.pmt.median.100  = home.pmt.median) %>%
  dplyr::filter(denominator >= 500 & complete.cases(.))

summary(testing.sites.acs.complete)
length(unique(testing.sites.acs.complete$GEOID))

I join the census tract GEOIDs to the daily PM2.5 data, and clean and summarize the PM2.5 data by tract. Tracts with fewer than 50 days of valid data are excluded.

summary(clean.2019.dailyPM2.5)

# Create a file of site IDs with GEOIDs that have complete ACS data.
site_id.GEOID.xwalk <- testing.sites.acs.complete %>%
                        dplyr::select(site_id,GEOID) %>%
                        unique()

GEOID.dailyPM2.5 <- inner_join( site_id.GEOID.xwalk
                               ,clean.2019.dailyPM2.5
                               ,by="site_id")

dim(GEOID.dailyPM2.5)
# 1,396,734 rows

dim(unique(GEOID.dailyPM2.5[,c("GEOID","date_local")]))
# 243,809 tract-days

table(GEOID.dailyPM2.5$event_type, useNA="always")
# Included, Excluded, None
table(GEOID.dailyPM2.5$validity_indicator, useNA="always")
table(GEOID.dailyPM2.5$validity_indicator, GEOID.dailyPM2.5$event_type, useNA="always")
# Y, N. 4282 x N
# Excluding validity_indicator == "N" does not get rid of a meaningful % of events
# GEOID.dailyPM2.5 %>% dplyr::filter(validity_indicator == "N") %>% View()

summary(GEOID.dailyPM2.5$arithmetic_mean)

# Create preliminary outcome variables
GEOID.daily.summary <-
  GEOID.dailyPM2.5 %>%
  dplyr::filter(validity_indicator=="Y") %>%
  mutate(arithmetic_mean = case_when(arithmetic_mean < 0 ~ 0, TRUE ~ arithmetic_mean)) %>%
  group_by(GEOID,date_local) %>%
  summarise(mean = mean(arithmetic_mean),.groups = "keep")

# table(GEOID.daily.summary$any_event)
summary(GEOID.daily.summary$mean)

GEOID.AQI.outcomes <-
  GEOID.daily.summary %>%
  mutate(ungreen = case_when(mean > 12.0 ~ 1L, TRUE ~ 0L)) %>%
  group_by(GEOID) %>%
  summarise( pm2.5_days         = n()
            ,pm2.5_p50          = quantile(mean, 0.50)
            ,pm2.5_ungreen_days = sum(ungreen)
            ,.groups="keep")

head(GEOID.AQI.outcomes)
summary(GEOID.AQI.outcomes)

# How many days of valid data do we have per tract?
table(GEOID.AQI.outcomes$pm2.5_days)

# Create final PM2.5 outcome variables
AQI.outcomes.final <-
  GEOID.AQI.outcomes %>%
  dplyr::filter(pm2.5_days >= 50) %>%
  mutate( pm2.5_pct_ungreen    = 100 * pm2.5_ungreen_days / pm2.5_days
         ,pm2.5_flag10_ungreen = case_when(pm2.5_ungreen_days * 10 > pm2.5_days ~ 1L, TRUE ~ 0L)
         ,pm2.5_flag20_ungreen = case_when(pm2.5_ungreen_days *  5 > pm2.5_days ~ 1L, TRUE ~ 0L)
         ) %>%
  mutate( pm2.5_flag10_ungreen = factor(pm2.5_flag10_ungreen
                                        ,levels = c(0,1)
                                        ,labels = c("0-10%",">10%"))
         ,pm2.5_flag20_ungreen = factor(pm2.5_flag20_ungreen
                                        ,levels = c(0,1)
                                        ,labels = c("0-20%",">20%"))) %>%
  dplyr::select(GEOID, pm2.5_p50, pm2.5_pct_ungreen, pm2.5_flag10_ungreen, pm2.5_flag20_ungreen)

# Review created outcome variables
head(AQI.outcomes.final)
summary(AQI.outcomes.final$pm2.5_p50)
summary(AQI.outcomes.final$pm2.5_pct_ungreen)
table(AQI.outcomes.final$pm2.5_flag10_ungreen)
table(AQI.outcomes.final$pm2.5_flag20_ungreen)
AQI.outcomes.final %>% dplyr::filter(pm2.5_flag10_ungreen==">10%") %>% summary()
AQI.outcomes.final %>% dplyr::filter(pm2.5_flag20_ungreen==">20%") %>% summary()

With both predictors and outcomes summarized by census tract, the ACS and AQS files can now be joined to create the final data set.

str(AQI.outcomes.final)
str(testing.sites.acs.complete)

# Drop unnecessary geographic variables from the ACS file
acs.joinready <- 
  testing.sites.acs.complete %>%
  dplyr::select(-site_id, -STATEFP, -COUNTYFP) %>%
  unique()

dim(acs.joinready)
length(unique(acs.joinready$GEOID))
# 920 rows, unique on GEOID

# Join PM2.5 and ACS data by tract on GEOID
AQI.ACS <- inner_join(AQI.outcomes.final,acs.joinready,by="GEOID")
head(AQI.ACS)

# Prepare the geography/geometry file for linkage
str(testing.sites.tracts)
geom.joinready <-
  testing.sites.tracts %>%
  dplyr::select(GEOID,NAMELSAD,STATEFP,state,COUNTYFP,county,city,cbsa,cbsa_code,geometry) %>%
  unique()
dim(geom.joinready)
length(unique(geom.joinready$GEOID))
geom.joinready %>% group_by(GEOID) %>% summarise(obs=n(),.groups="keep") %>% arrange(-obs) %>% head()
geom.joinready %>% dplyr::filter(GEOID=="06079012304")

# Hard-code one value to be able to keep city attribute unique by GEOID
geom.joinready2 <- 
  geom.joinready %>%
  mutate(city = case_when(GEOID=="06079012304" ~ "Arroyo Grande", TRUE ~ city)) %>%
  unique()

dim(geom.joinready2)
length(unique(geom.joinready2$GEOID))
head(geom.joinready2)

# Join geography/geometry file with AQS outcomes and ACS variables
AQI.ACS.geom <- inner_join(geom.joinready2,AQI.ACS,by="GEOID") %>% group_by()
head(AQI.ACS.geom)
class(AQI.ACS.geom)

Results

Describe your results and include relevant tables, plots, and code/comments used to obtain them. End with a brief conclusion of your findings related to the question you set out to address. You can include references if you’d like, but this is not required.

XXXXXXXXXXXXXXXXXPerform bivariate tests of community characteristics vs air quality (continous measures and 1/0 for poor air quality above threshold). Create multivariate models to predict air quality from community characteristics. Construct a classifier using community characteristics and evaluate its performance against the real data. XXXXXXXXXXXXXXXXX

Continuous PM2.5 outcomes by tract, mapped:

AQI.ACS.geom.WGS84 <- st_transform(AQI.ACS.geom, 4326) # FROM 4269 TO 4326 to cooperate with leaflet

palette <- colorNumeric("viridis", NULL, reverse=TRUE)

# Median PM2.5 by tract - interactive
popup_msg <- paste0(AQI.ACS.geom.WGS84$county," County, ",AQI.ACS.geom.WGS84$state,", ",AQI.ACS.geom.WGS84$NAMELSAD
                   ,"<br>Metro Area: ",AQI.ACS.geom.WGS84$cbsa
                   ,"<br>Median PM2.5: ",round(AQI.ACS.geom.WGS84$pm2.5_p50,digits=2))

leaflet(AQI.ACS.geom.WGS84) %>%
  addPolygons(fillColor = ~palette(pm2.5_p50)
              ,fillOpacity = 0.9
              ,weight = 1.5
              ,color = "black"
              ,popup = popup_msg) %>%      
  addTiles() %>%
  addLegend("bottomright",         
            pal=palette,      
            values=~pm2.5_p50,   
            title = 'PM2.5',  
            opacity = 1) %>%    
  addScaleBar()
# % of days above 12 mcg/m3 by tract - interactive
popup_msg2 <- paste0(AQI.ACS.geom.WGS84$county," County, ",AQI.ACS.geom.WGS84$state,", ",AQI.ACS.geom.WGS84$NAMELSAD
                   ,"<br>Metro Area: ",AQI.ACS.geom.WGS84$cbsa
                   ,"<br>Reached 12mcg/m3 on ",round(AQI.ACS.geom.WGS84$pm2.5_pct_ungreen,digits=1),"% of days")

leaflet(AQI.ACS.geom.WGS84) %>%
  addPolygons(fillColor = ~palette(pm2.5_pct_ungreen)
              ,fillOpacity = 0.9
              ,weight = 1.5
              ,color = "black"
              ,popup = popup_msg2) %>%      
  addTiles() %>%
  addLegend("bottomright",         
            pal=palette,      
            values=~pm2.5_pct_ungreen,   
            title = '% days of 12+ mcg/m3 PM2.5',  
            opacity = 1) %>%    
  addScaleBar()

Regression results, using all 19 ACS-based predictor variables.

# Run linear regression for continuous outcomes
lm.p50 <- lm(pm2.5_p50 ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom)

lm.pct.ungreen <- lm(pm2.5_pct_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom)

# Run logistic regression for binary outcomes
logit.flag10.ungreen <- glm(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom
             ,family = binomial())

logit.flag20.ungreen <- glm(pm2.5_flag20_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             ,data = AQI.ACS.geom
             ,family = binomial())

# Extract p-values by predictor variable from regression results into a summary data frame.
lm.p50.pval.df <- data.frame(acs.var = names(summary(lm.p50)$coefficients[-1,4])
                             ,lm.p50.pval  = summary(lm.p50)$coefficients[-1,4])

lm.pct.ungreen.pval.df <- data.frame(acs.var = names(summary(lm.pct.ungreen)$coefficients[-1,4])
                                    ,lm.pct.ungreen.pval   = summary(lm.pct.ungreen)$coefficients[-1,4])

logit.flag10.ungreen.pval.df <- data.frame(acs.var = names(summary(logit.flag10.ungreen)$coefficients[-1,4])
                                    ,logit.flag10.ungreen.pval = summary(logit.flag10.ungreen)$coefficients[-1,4])

logit.flag20.ungreen.pval.df <- data.frame(acs.var = names(summary(logit.flag20.ungreen)$coefficients[-1,4])
                                    ,logit.flag20.ungreen.pval = summary(logit.flag20.ungreen)$coefficients[-1,4])

# Join p-value data frames together by ACS-based variable name. 
# Sort ascending by p-value for median PM2.5 regression
pval1 <- inner_join(lm.p50.pval.df,lm.pct.ungreen.pval.df, by="acs.var")
pval2 <- inner_join(pval1,logit.flag10.ungreen.pval.df, by="acs.var")
pval3 <- inner_join(pval2,logit.flag20.ungreen.pval.df, by="acs.var") %>% arrange(lm.p50.pval)

# Return results
summary.lm(lm.p50)
summary(lm.pct.ungreen)
summary(logit.flag10.ungreen)
summary(logit.flag20.ungreen)
pval3
## 
## Call:
## lm(formula = pm2.5_p50 ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, data = AQI.ACS.geom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -6.3401 -1.0163 -0.0799  1.0319  6.7636 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            10.281525   1.188349   8.652  < 2e-16 ***
## age.median             -0.030768   0.012186  -2.525 0.011746 *  
## commute.time.agg.10k    0.011560   0.020144   0.574 0.566215    
## income.hh.median.10k   -0.020663   0.061624  -0.335 0.737473    
## rent.median.100         0.013211   0.027278   0.484 0.628282    
## home.pmt.median.100    -0.015226   0.018848  -0.808 0.419410    
## race.black.any          0.015403   0.003726   4.134 3.90e-05 ***
## ethn.nhisp.white.alone -0.007592   0.003167  -2.398 0.016705 *  
## married                 0.004083   0.008184   0.499 0.617959    
## kids                   -0.008251   0.006990  -1.180 0.238167    
## bachelor.plus           0.005070   0.006317   0.803 0.422426    
## veteran                -0.046590   0.018786  -2.480 0.013321 *  
## manufacturing           0.031802   0.008779   3.623 0.000308 ***
## farm.fish.mining       -0.046542   0.010051  -4.630 4.19e-06 ***
## commutes.car.alone      0.019526   0.005590   3.493 0.000501 ***
## renter                  0.004082   0.005638   0.724 0.469257    
## single.fam.home        -0.002814   0.003933  -0.716 0.474407    
## worked.12mo            -0.041731   0.008345  -5.000 6.89e-07 ***
## poverty.12mo           -0.011233   0.009332  -1.204 0.229001    
## snap.12mo              -0.002669   0.007966  -0.335 0.737633    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.592 on 889 degrees of freedom
## Multiple R-squared:  0.2524, Adjusted R-squared:  0.2364 
## F-statistic:  15.8 on 19 and 889 DF,  p-value: < 2.2e-16
## 
## 
## Call:
## lm(formula = pm2.5_pct_ungreen ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, data = AQI.ACS.geom)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -19.932  -5.907  -1.368   4.229  47.795 
## 
## Coefficients:
##                         Estimate Std. Error t value Pr(>|t|)    
## (Intercept)            27.820524   6.444245   4.317 1.76e-05 ***
## age.median             -0.096365   0.066081  -1.458 0.145117    
## commute.time.agg.10k    0.016552   0.109237   0.152 0.879597    
## income.hh.median.10k   -0.211577   0.334180  -0.633 0.526815    
## rent.median.100        -0.005321   0.147927  -0.036 0.971314    
## home.pmt.median.100    -0.061119   0.102213  -0.598 0.550021    
## race.black.any          0.022965   0.020205   1.137 0.256017    
## ethn.nhisp.white.alone -0.050822   0.017172  -2.960 0.003162 ** 
## married                 0.013969   0.044380   0.315 0.753017    
## kids                   -0.008858   0.037905  -0.234 0.815273    
## bachelor.plus           0.021526   0.034253   0.628 0.529887    
## veteran                -0.136190   0.101874  -1.337 0.181615    
## manufacturing           0.172555   0.047607   3.625 0.000306 ***
## farm.fish.mining       -0.134663   0.054507  -2.471 0.013677 *  
## commutes.car.alone      0.025827   0.030314   0.852 0.394454    
## renter                  0.029989   0.030575   0.981 0.326949    
## single.fam.home         0.022962   0.021327   1.077 0.281930    
## worked.12mo            -0.152661   0.045256  -3.373 0.000775 ***
## poverty.12mo           -0.061633   0.050605  -1.218 0.223584    
## snap.12mo               0.021525   0.043200   0.498 0.618417    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 8.632 on 889 degrees of freedom
## Multiple R-squared:  0.1533, Adjusted R-squared:  0.1353 
## F-statistic: 8.475 on 19 and 889 DF,  p-value: < 2.2e-16
## 
## 
## Call:
## glm(formula = pm2.5_flag10_ungreen ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, family = binomial(), data = AQI.ACS.geom)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.4284  -1.0623   0.5540   0.9632   1.7815  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)    
## (Intercept)             5.0426702  1.7523543   2.878  0.00401 ** 
## age.median             -0.0306630  0.0171606  -1.787  0.07397 .  
## commute.time.agg.10k   -0.0077921  0.0271376  -0.287  0.77401    
## income.hh.median.10k    0.0237597  0.0839018   0.283  0.77704    
## rent.median.100        -0.0340125  0.0371454  -0.916  0.35984    
## home.pmt.median.100    -0.0275504  0.0263048  -1.047  0.29494    
## race.black.any          0.0151165  0.0059452   2.543  0.01100 *  
## ethn.nhisp.white.alone -0.0108105  0.0044240  -2.444  0.01454 *  
## married                 0.0052513  0.0115284   0.456  0.64874    
## kids                   -0.0061786  0.0099806  -0.619  0.53588    
## bachelor.plus           0.0118970  0.0087472   1.360  0.17380    
## veteran                -0.0267337  0.0259119  -1.032  0.30221    
## manufacturing           0.0727012  0.0138298   5.257 1.47e-07 ***
## farm.fish.mining       -0.0343879  0.0146051  -2.355  0.01855 *  
## commutes.car.alone     -0.0048214  0.0079629  -0.605  0.54486    
## renter                  0.0069986  0.0079342   0.882  0.37773    
## single.fam.home        -0.0008233  0.0056277  -0.146  0.88369    
## worked.12mo            -0.0349466  0.0123114  -2.839  0.00453 ** 
## poverty.12mo           -0.0245880  0.0134790  -1.824  0.06813 .  
## snap.12mo               0.0059546  0.0114358   0.521  0.60258    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 1230.0  on 908  degrees of freedom
## Residual deviance: 1078.6  on 889  degrees of freedom
## AIC: 1118.6
## 
## Number of Fisher Scoring iterations: 4
## 
## 
## Call:
## glm(formula = pm2.5_flag20_ungreen ~ age.median + commute.time.agg.10k + 
##     income.hh.median.10k + rent.median.100 + home.pmt.median.100 + 
##     race.black.any + ethn.nhisp.white.alone + married + kids + 
##     bachelor.plus + veteran + manufacturing + farm.fish.mining + 
##     commutes.car.alone + renter + single.fam.home + worked.12mo + 
##     poverty.12mo + snap.12mo, family = binomial(), data = AQI.ACS.geom)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.3347  -0.6637  -0.5104  -0.3763   2.4371  
## 
## Coefficients:
##                          Estimate Std. Error z value Pr(>|z|)   
## (Intercept)             7.962e-01  1.923e+00   0.414  0.67891   
## age.median             -2.691e-03  2.082e-02  -0.129  0.89717   
## commute.time.agg.10k   -1.050e-02  3.917e-02  -0.268  0.78865   
## income.hh.median.10k   -2.337e-01  1.215e-01  -1.923  0.05451 . 
## rent.median.100         4.126e-02  5.053e-02   0.816  0.41426   
## home.pmt.median.100     1.835e-05  3.366e-02   0.001  0.99957   
## race.black.any          1.417e-03  5.585e-03   0.254  0.79970   
## ethn.nhisp.white.alone -9.910e-03  5.076e-03  -1.952  0.05092 . 
## married                 1.540e-03  1.376e-02   0.112  0.91089   
## kids                    7.767e-03  1.155e-02   0.673  0.50125   
## bachelor.plus           1.035e-02  1.109e-02   0.933  0.35085   
## veteran                -1.466e-02  3.223e-02  -0.455  0.64912   
## manufacturing           9.666e-03  1.415e-02   0.683  0.49446   
## farm.fish.mining       -1.214e-02  1.786e-02  -0.679  0.49683   
## commutes.car.alone      1.179e-02  9.554e-03   1.234  0.21711   
## renter                  9.602e-04  9.174e-03   0.105  0.91664   
## single.fam.home         8.781e-03  6.445e-03   1.362  0.17310   
## worked.12mo            -3.752e-02  1.316e-02  -2.850  0.00437 **
## poverty.12mo           -5.823e-03  1.521e-02  -0.383  0.70187   
## snap.12mo              -3.912e-03  1.249e-02  -0.313  0.75423   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 870.13  on 908  degrees of freedom
## Residual deviance: 802.87  on 889  degrees of freedom
## AIC: 842.87
## 
## Number of Fisher Scoring iterations: 5
## 
##                   acs.var  lm.p50.pval lm.pct.ungreen.pval logit.flag10.ungreen.pval logit.flag20.ungreen.pval
## 1             worked.12mo 6.892815e-07        0.0007749118              4.531875e-03               0.004367877
## 2        farm.fish.mining 4.193095e-06        0.0136765637              1.854684e-02               0.496829902
## 3          race.black.any 3.901449e-05        0.2560166477              1.100201e-02               0.799699586
## 4           manufacturing 3.082985e-04        0.0003058825              1.465384e-07               0.494461052
## 5      commutes.car.alone 5.012410e-04        0.3944535502              5.448595e-01               0.217107780
## 6              age.median 1.174578e-02        0.1451167688              7.396545e-02               0.897173725
## 7                 veteran 1.332100e-02        0.1816154439              3.022059e-01               0.649115596
## 8  ethn.nhisp.white.alone 1.670511e-02        0.0031615614              1.454114e-02               0.050915996
## 9            poverty.12mo 2.290010e-01        0.2235836814              6.812551e-02               0.701870729
## 10                   kids 2.381670e-01        0.8152731603              5.358795e-01               0.501254312
## 11    home.pmt.median.100 4.194097e-01        0.5500206039              2.949380e-01               0.999565046
## 12          bachelor.plus 4.224260e-01        0.5298874824              1.738009e-01               0.350849549
## 13                 renter 4.692571e-01        0.3269485271              3.777317e-01               0.916636934
## 14        single.fam.home 4.744072e-01        0.2819304914              8.836892e-01               0.173100905
## 15   commute.time.agg.10k 5.662155e-01        0.8795974016              7.740095e-01               0.788645632
## 16                married 6.179593e-01        0.7530172352              6.487405e-01               0.910894353
## 17        rent.median.100 6.282817e-01        0.9713137300              3.598450e-01               0.414260334
## 18   income.hh.median.10k 7.374727e-01        0.5268151364              7.770352e-01               0.054510257
## 19              snap.12mo 7.376328e-01        0.6184171875              6.025790e-01               0.754228687

Bivariate plots for most strongly-associated predictor variables with median PM2.5 and 10%-of-days outcomes:

# worked.12mo
ggplot(data = AQI.ACS.geom, aes(x = worked.12mo, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% Worked in Past 12 Months") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = worked.12mo)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% Worked in Past 12 Months") 

# manufacturing
ggplot(data = AQI.ACS.geom, aes(x = manufacturing, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% of Workers in Manufacturing Industry") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = manufacturing)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Workers in Manufacturing Industry") 

# farm.fish.mining
ggplot(data = AQI.ACS.geom, aes(x = farm.fish.mining, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% of Workers in Agriculture, Forestry,\nFishing & Hunting, and Mining Industry Category") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = farm.fish.mining)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Workers in Agriculture, Forestry,\nFishing & Hunting, and Mining Industry Category")
## Warning in regularize.values(x, y, ties, missing(ties), na.rm = na.rm): collapsing to unique 'x' values

# ethn.nhisp.white.alone
ggplot(data = AQI.ACS.geom, aes(x = ethn.nhisp.white.alone, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% of Respondents Reporting\nNon-Hispanic Ethnicity and White Race Only") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = ethn.nhisp.white.alone)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Respondents Reporting\nNon-Hispanic Ethnicity and White Race Only")

# race.black.any
ggplot(data = AQI.ACS.geom, aes(x = race.black.any, y = pm2.5_p50)) + 
  geom_point() + geom_smooth(color = "red", method = "lm") + 
  xlab("% of Respondents Reporting Black Race\n(with or without other categories)") +
  ylab("Median Daily PM2.5 Concentration (mcg/m3)")
## `geom_smooth()` using formula 'y ~ x'

ggplot(data = AQI.ACS.geom, aes(x = pm2.5_flag10_ungreen, y = race.black.any)) +
  geom_violin(fill = "steelblue1", draw_quantiles = c(0.25, 0.5, 0.75)) + 
  xlab("% of Days with PM2.5 >= 12mcg/m3") + 
  ylab("% of Respondents Reporting Black Race\n(with or without other categories)")

Perform 5-fold cross-validation: * Logistic regression with all 19 ACS-based variables * Logistic regression with the top 8 variables from regression analysis above * Random forest * Support vector machines with radial kernel

# Generate a column containing the 5 groups for 5-fold cross-validation
set.seed(1504492)
kgroup <- data.frame(kgroup = sample(1:5, size = dim(AQI.ACS)[1], replace = T))
AQI.ACS.onlynec <- bind_cols(AQI.ACS,data.frame(kgroup = sample(1:5, size = dim(AQI.ACS)[1], replace = T))) %>% 
                    select(-pm2.5_p50, -pm2.5_pct_ungreen, -pm2.5_flag20_ungreen, -denominator) %>%
                    group_by()

# Loop through the 5 groups
for (i in 1:5) {

    train <- filter(AQI.ACS.onlynec, kgroup != i)
    test  <- filter(AQI.ACS.onlynec, kgroup == i)
    
  # GLM: All variables
    glm <- train %>% glm(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             , data = .
             , family = binomial())
    
    glm.predict <- bind_cols(test[,"GEOID"], data.frame(glmpred = predict(glm, test, type = "response")))
    
  # GLM: Selected variables
    select.glm <- train %>% glm(pm2.5_flag10_ungreen ~ 
               age.median + race.black.any + ethn.nhisp.white.alone + veteran 
               + manufacturing + farm.fish.mining + worked.12mo + commutes.car.alone
             , data = .
             , family = binomial())
    
    select.predict <- bind_cols(test[,"GEOID"], data.frame(selectpred = predict(select.glm, test, type = "response")))
    
    # Random forest
    rf <- train %>% randomForest(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             , data = .
             , ntree = 2000)
    
    rf.predict <- bind_cols(test[,"GEOID"], data.frame(rfpred = predict(rf, test, type = "prob")[,2]))
    
    # Support vector machines with radial kernel
    svm <- train %>% svm(pm2.5_flag10_ungreen ~ 
               age.median + commute.time.agg.10k + income.hh.median.10k + rent.median.100 
             + home.pmt.median.100 + race.black.any + ethn.nhisp.white.alone + married + kids 
             + bachelor.plus + veteran + manufacturing + farm.fish.mining + commutes.car.alone 
             + renter + single.fam.home + worked.12mo + poverty.12mo + snap.12mo
             , data = .
             , scale = TRUE
             , kernel = "radial"
             , probability = TRUE)
    
    svm.step1 <- predict(svm, test, probability = TRUE)

    svm.predict <- bind_cols(test[,"GEOID"], data.frame(svmpred = attr(svm.step1, "probabilities")[,2]))

    # Join the predictions together by GEOID
    two.predict   <- inner_join(glm.predict, rf.predict, by="GEOID")
    three.predict <- inner_join(two.predict, select.predict, by="GEOID")
    four.predict  <- inner_join(three.predict, svm.predict, by="GEOID")
        
    if (i==1){
      collect.predict <- four.predict
    } else {
      collect.predict <- bind_rows(collect.predict, four.predict)
    }
}

AQI.ACS.preds <- inner_join(AQI.ACS.onlynec, collect.predict, by = "GEOID")
# Calculate AUROC values for each method under 5-fold cross-validation
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, rfpred, ci=TRUE)
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, glmpred, ci=TRUE)
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, selectpred, ci=TRUE)
AQI.ACS.preds %>% roc(pm2.5_flag10_ungreen, svmpred, ci=TRUE)

# Plot ROC curves for each method
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$svmpred, col="yellow")
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$rfpred, col="darkgreen", add=TRUE)
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$glmpred, col="blue", add=TRUE)
plot.roc(AQI.ACS.preds$pm2.5_flag10_ungreen, AQI.ACS.preds$selectpred, col="orange", add=TRUE)
legend("bottomright"
       ,legend = c("SVM (Radial)","Random Forest", "Logistic: All Vars", "Logistic: Select Vars")
       ,col = c("yellow","darkgreen", "blue", "orange")
       ,lwd = 1)

## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = rfpred,     ci = TRUE)
## 
## Data: rfpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.6966
## 95% CI: 0.6618-0.7314 (DeLong)
## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = glmpred,     ci = TRUE)
## 
## Data: glmpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.7062
## 95% CI: 0.6719-0.7405 (DeLong)
## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = selectpred,     ci = TRUE)
## 
## Data: selectpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.7157
## 95% CI: 0.6821-0.7494 (DeLong)
## 
## Call:
## roc.data.frame(data = ., response = pm2.5_flag10_ungreen, predictor = svmpred,     ci = TRUE)
## 
## Data: svmpred in 372 controls (pm2.5_flag10_ungreen 0-10%) < 537 cases (pm2.5_flag10_ungreen >10%).
## Area under the curve: 0.706
## 95% CI: 0.6713-0.7407 (DeLong)